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^^ ' Abstract 

1/^ ■ Implicit particle filters for data assimilation update the particles by 

c 2 ^ ' first choosing probabilities and then looking for particle locations that 

("V^ , assume them, guiding the particles one by one to the high probability 

domain. We provide a detailed description of these filters, with illustra- 
tive examples, together with new, more general, methods for solving the 
algebraic equations and with a new algorithm for parameter identification. 

. r* , 1 Introduction 

There are many problems in science, for example in meteorology and economics, 
in which the state of a system must be identified from an uncertain equation 
supplemented by noisy data (see e.g. [71 [T^). A natural model of this situation 
consists of an Ito stochastic differential equation (SDE): 

dx ~ f{x,t) dt + g{x,t) dw, (1) 

where x — (xi, X2, ■ ■ ■ , Xm) is an ?Ti-dimensional vector, / is an jn-dimensional 
vector function, g{x,t) is an m by ttt. matrix, and w is Brownian motion which 
encapsulates all the uncertainty in the model. In the present paper we assume 



for simplicity that the matrix g{x,t) is diagonal. The initial state a::(0) is given 
and may be random as well. 

The SDE is supplemented by measurements 6" at times f\n — 0,1,.... 
The measurements are related to the state x{t) by 

6" = /i(a;") + GW^", (2) 

where h is a, /c-dimensional, generally nonlinear, vector function with k < m, G 
is a matrix, x" — x{t"'), and W" is a vector whose components are independent 
Gaussian variables of mean and variance 1, independent also of the Brownian 
motion in equation ([1]). The independence requirements can be greatly relaxed 
but will be observed in the present paper. The task of a filter is to assimilate 
the data, i.e., estimate x on the basis of both equation ([T]) and the observations 

If the system ([T]) and equation ([2]) are linear and the data are Gaussian, the 
solution can be found in principle via the Kalman-Bucy filter (see e.g. [H]). 
In the general case, one often estimates a; as a statistic (often the mean) of a 
probability density function (pdf) evolving under the combined effect of equa- 
tions ll]) and ^. The initial state x'^ being known, all one has to do is evaluate 
sequentially the pdfs P„+i of the variables x""*"^ given the equations and the 
data. In a "particle" filter this is done by following "particles" (replicas of the 
system) whose empirical distribution at time i" approximates P„. One may for 
example (see e.g. [J [71 [HI [H]) use the pdf P„ and equation (|T]) to generate a 
prior density (in the sense of Bayes) , and then use the data 6""*"^ to generate 
sampling weights which define a posterior density Pn+i- In addition, one has 
to sample backward to take into account the information each measurement 
provides about the past. This process can be very expensive because in most 
weighting schemes, most of the weights tend to zero fast and the number of 
particles needed can grow catastrophically (see e.g. [HI [2]); various strategies 
have been proposed to ameliorate this problem. 

Our remedy is implicit sampling |5l [6]. The number of particles needed 
in a filter remains moderate if one can find high probability particles; to this 
end, implicit sampling works by first picking probabilities and then looking for 
particles that assume them, so that particles are guided efficiently to the high 
probability region one at a time, without needing a global guess of the target 
density. In the present paper we provide an expository account of particle 
filters, separating clearly the general principles from details of implementation; 
we provide general solution algorithms for the resulting algebraic equations, 
in particular for nonconvex cases which we had not considered in our previous 
publications, as well as a new algorithm for parameter identification based on an 
implicit filter. We also provide examples, in particular of nonconvex problems. 

Implicit filters are a special case of chainless/Markov field sampling meth- 
ods [21 |4j ; a key connection was made in [161 Hi] j where it was observed that 
in the sampling of stochastic differential equations, the marginals needed in 
Markov field sampling can be read off the equations and need not be estimated 
numerically. 



2 The mathematical framework 

The conditional probability density P„(x) at time i", determined by the SDE 
([l} given the observations ([2]), satisfies the recurrence relation (see e.g. [Tj): 

P„+i(a;"+i) = P„(x")P(x"+i|x")P(6"+i|x"+i)/Zo, (3) 

where P„+i(a;"+^) is the probability of the sample a;"+^ at time i"+^ given the 
observations V for j < n + 1, P„(a;") is the probability of a the sample x" at 
time i" given the observations V for j < n, P(a;"+^|a;") is the probability of a 
sample x"+^ at time i"+^ given a sample a;" at time i", P(fe"+^|x"+^) is the 
probability of the observations 6"+^ given the sample a;"+^ at time i"+^, and 
Zq is a normalization constant independent of a:" and x"+^. This is Bayes' 
theorem. 

We estimate P„+i with the help of M particles, with positions X" at time i" 
and X"^^ at time i""*"-^ (« = !,..., M), which define empirical densities P„, Pn+i 
that approximate P„, P„+i. We do this by requiring that, when a particle moves 
from Xf to Xf +^ the probability of X^+^ be 

p(xr+') = p(xf )p(xf+i|xr)p(&"+i|xr+')/^o, (4) 

where the hats have been omitted, P{Xf), the probability of X", is assumed 
given, the pdf P(Xf+^|X,"), the probability of Xf+^ given XJ", is determined 
by the SDE (P, the pdf P(6"+i|Xf +i), the probability of the observations &"+i 
given the new positions X''^'^^ , is determined by the observation equation ^. 
We shall see below that one can set P(X") = 1 without loss of generality. 

Equation (g]) defines the pdf we need to sample for each particle; this pdf is 
known, in the sense that once one has a sample, one can evaluate its probability 
(up to a constant); the difficulty is to find high probability samples, especially 
when the number of variables is large. The idea in implicit sampling is to define 
probabilities first, and then look for particles that assume them; this is done by 
choosing once and for all a fixed reference random variable, say ^, with a given 
pdf, say a Gaussian exp(— ^^^/2)/(27r)™/^), which one knows how to sample so 
that most samples have high probability, and then making X"^^ a function of 
^, a different function of each particle and each step, each function designed so 
that the map ^ — >■ X""^^ connects highly probable values of ^ to highly probable 
values of X"'*'"'^. To that end, write 

P(Xf+i|Xf )P(&"+i|Xf+i) = exp(-P,(X)), 

where on the right-hand side X is a shorthand for X"~^^ and all the other 
arguments are omitted. This defines a function Fi for each particle i and each 
time i". For each i and n, Fi is an explicitly known function oi X = X"'^^. 
Then solve the equation 

F,{X)-^,=e^/2, (5) 

where ^ is a sample of the fixed reference variable and (j)i is an additive factor 
needed to make the equation solvable. The need for (j)i becomes obvious if one 



considers the case of a linear observation function h in equation ^, so that the 
right side of equation ([5]) is quadratic but the left is a quadratic plus a constant. 
It is clear that setting (j) = niinF will do the job, but this is not necessarily 
the best choice (see below). We also require that for each particle, the function 
X"'^^ — X = X(^) defined by ^ be one-to-one so that the correct pdf is 
sampled, in particular, it must have distinct branches for ^ > and ^ < 0. The 
solution of ([5]) is discussed in the next section. From now on we omit the index 
i in both F and (f>, but it should not be forgotten that these function vary from 
particle to particle and from one time step to the next. 

Once the function X — X{£^) is determined, each value of X"+^ — X (the 
subscript i is omitted) appears with probability exp(— ^-^^/2) J~^/(7r)™/^, where 
J is the Jacobian of the map X ^ X{^), while the product P{X"+^\X")P{b"+^\X"+^) 
evaluated at X"+^ equals exp(-^^^/2) exp(-0)/(27r)'"/^. The sampling weight 
for the particle is therefore exp(— 0) J. If the map ^ — >■ X is smooth near ^ = 0, 
so that (j) and J do not vary rapidly from particle to particle, and if there is an 
easy way to compute J (see the next section), then we have an effective way 
to sample Pn+i given P„. It is important to note that though the functions F 
and <f> vary from particle to particle, the probabilities of the various samples are 
expressed in terms of the fixed reference pdf, so that they can be compared with 
each other. 

The weights can be eliminated by resampling. A standard resampling algo- 
rithm goes as follows [7]: let the weight of the i-th particle he Wi,i — 1, . . . , M. 
Define A ~ X]^i! ^o^ each of M random numbers 9 k, k — 1,...,M drawn 
from the uniform distribution on [0, 1], choose a new XJ!'^^ = X^^^ such that 

A^^ J2]=i ^j < Sk < A^^ ^7=1 ^ji ^^d then suppress the hat. This justifies 
the statement following equation fl]) that one can set P(X„) — 1. 

To see what has been gained, compare our construction with the usual 
"Bayesian" particle filter, where one samples P(X"+i|X")P(fe"+i|X"+i) by 
first finding a "prior" density Q{X"~^^) (omitting all arguments other than 
X"+i), such that the ratio W = P(X"+i)/Q(X"+i) is close to a constant, 
and then assigning to the i-th particle the importance weight W = Wi evalu- 
ated at the location of the particle. The pdf defined by the set of positions and 
weights is the density P„+i we are looking for. An important special case is the 
choice Q(X"+^) = P(X"+^|X"); the prior is then defined by the equation of 
motion alone and the posterior is obtained by using the observations to weight 
the particles. We shall refer to this special case as "standard importance sam- 
pling" or "standard filter" . Of course, once the positions and the weights of the 
particles have been determined, one should resample as above. 

The catch in these earlier constructions is that the prior density Q and the 
desired posterior can come close to being mutually singular, and the number of 
particles needed may become catastrophically large, especially when the number 
of variables m is large. To avoid this catch one has to make a good guess for 
the pdf Q, which may not be easy because Q should approximate the density 
Pn+i one is looking for- this is the basic conundrum of Monte Carlo methods, in 
which one needs a good estimate to get a good estimate. In contrast, in implicit 



sampling one does a separate calculation for each sample and there is no need 
for prior global information. One can of course still identify the pdf defined by 
the positions of the particles at time t"^-^ as a "prior" and the pdf defined by 
both the positions and the weights as a "posterior" density. 

Finally, implicit sampling can be viewed as an implicit Monte Carlo scheme 
for solving the Zakai equation [TS] , which describes the evolution of the (unnor- 
malized) conditional distribution for a SDE conditioned by observations. This 
should be contrasted with the procedure in the popular ensemble Kalman filter 
(see e.g. [S]), where a Gaussian approximation of the pdf defined by the SDE 
is extracted from a Monte Carlo solution of the corresponding Fokker-Planck 
equation, a Gaussian approximation is made for the pdf P(6"+^|a;""'"^), and 
new particle positions are obtained by a Kalman step. Our replacement of the 
Fokker-Planck equation that corresponds to the SDE alone by a Zakai equation 
that describes the evolution of the unnormalized conditional distribution does 
away with the need for the approximate and expensive extraction of Gaussians 
and consequent Kalman step. 

3 Solution of the algebraic equation that defines 
a new sample 

We now explain how to solve equation ([SJ, F{X) — <j) — C"^f/2, under several 
sets of assumptions which are met in practice. Note the great latitude this 
equation provides in linking the ^ variables to the X variables; equation ([5]) is a 
single equation that connects 2m variables (the m components of ^ and the m 
components of X) and can be satisfied by many maps ^ — >■ X; these are useful 
as long as (i) they are one-to-one, (ii) they map the neighborhood of into a set 
that contains the minimum of F, (iii) they are smooth near ^ = so that the 
weights exp(— 0) and the Jacobian J not vary unduly from particle to particle 
in the target area, and (iv) they allow the Jacobian J to be calculated easily. 
The solution methods presented here are far from exhaustive; further examples 
and refinements in the context of specific applications. 

Algorithm (A) (presented in [SJ [5]) : Assume the function F is convex 
upwards. For each particle, we set up an iteration, with iterates X"~^^'^ , j = 
0, 1, ... , {X^ for brevity), with X^ = 0, that converge to the next position X"+^ 
of that particle. The index i that identifies the particle is omitted again. We 
write the equations as if the system were one-dimensional; the multidimensional 
case was presented in detail in [B]. First we sample the reference variable ^. The 
iteration is defined when one knows how to find X^^^ given X^ . 

Expand the observation function h in equation ([2]) around X^ : 

h{X^+^) = h{X^) + {Dhy{X^+^ - X^), (6) 

where {Dhy is the derivative of h evaluated at X^ . The observation equation 
© is now approximated as a linear function of X^~^^, and the function F is the 
sum of two Gaussians in X^+^. Completing a square yields a single Gaussian 



with a remainder (/>, i.e., F{X) — {x — a)^/{2v) + (j){X^), where the parameters 
(t),a,v are functions oi X^ (this is what we called in |5] a "pseudo-Gaussian"). 
The next iterate is now X^ = a + \^^. In the multidimensional case, when 
each component of the function h in equation ([2|) depends on more than one 
variable, finding X as a function of £, may require the solution of a linear system 
of equations, which can be performed e.g. by a Choleski factorization, as in [B], 
or by a rotation, as in [5]. If the iteration converges, it converges to the exact 
solution of equation ([5]), with (j) the limit of the (j>{X^). Its convergence can 
be accelerated by Aitken's extrapolation [10]. The Jacobian J can be evaluated 
either by an implicit differentiation of equation ([5]) or numerically, by perturbing 
^ in equation (O and solving the perturbed equation (which should not require 
more than a single additional iteration step) . It is easy to see that this iteration, 
when it converges, produces a mapping ^ — >■ X that is one to one and onto. 

An important special case occurs when the observation function h is linear 
in X; it is immaterial whether the SDE ([T]) is linear. In this case the iteration 
converges in one step; the Jacobian J is easy to find; if in addition the function 
g{x, t) in equation ([T|) is independent of x, then J is independent of the particle 
and need not be evaluated; the additive term (f) can be written explicitly as a 
function of the previous position X" of the particle and of the observation 6"+^. 
We recover an easy implementation of optimal sequential importance sampling 
(see e.g. [DIZllH]). 

This iteration has been used in [B]. It may fail to converge if the function 
F is not convex (as happens in particular when the observation function h is 
highly nonlinear). One may resort then to the next construction. 

Algorithm (B). Assume the function F is /7-shaped, i.e., in the scalar 
case, it is at least piecewise differentiable, F' vanishes at a single point which 
is a minimum, F is strictly decreasing on one side of the minimum and strictly 
increasing on the other, with F{X) = oo when X = ±cx). In the ?n-dimensional 
case, assume that F has a single minimum and that each intersection of the 
graph of the function y — F{X) with a vertical plane through the minimum is 
f7-shaped in the scalar sense (note that a function may be [/-shaped without 
being convex). 

Find z, the minimum of F (note that this is the minimum of a given real 
valued function, not a minimum of a possibly multimodal pdf generated by the 
SDE; finding this minimum is not equivalent to the difficult problem of finding 
a maximum likelihood estimate of the state of the system). The minimum z 
can be found by standard minimization algorithms. 

Again we are solving the equations by finding iterates X^ that converge to 
X"+^. In the scalar case, given a sample of the reference variable ^, find first 
X" such that X'^ — z has the sign of ^, and then find the next iterates X^ by 
standard tools (e.g. by Newton iteration), modified so that the X^ are prevented 
from leaping over z. 

In the vector case, if the observation function is diagonal, i.e. each com- 
ponent of the observation is a function of a single component of the solution 
X, then the scalar algorithm can be used component by component. In more 
complicated situations one can take advantage of the freedom in connecting ^ 



toX. 

Here is an interesting example of the use of this freedom, which we present 
in the case of a muhidimensional problem where the observation function is 
linear but need not be diagonal. Set = mini^. The function F{X) — cj) can 
now be written as {X — a)'^A{X — a)/2, where a is a known vector, T denotes 
a transpose as before, and ^ is a positive definite symmetric matrix. Write 
further y ^ X — a. Equation ^ becomes 

y^Ay^ie, (7) 

where |^| is the length of the vector ^. Make the ansatz: 

y = A?7, 

where A is a scalar, r; = C/I^l is a- random unit vector and ^ is a sample of of 
the reference density. Substitution into ([7]) yields 

X'iv^AT^) = \e- (8) 

It is easy to see that Elrji'qj] — Sij/m, where E[] denotes an expected value, 
the rji are the components of 77, m is the number of variables, and Sij is the 
Kronecker delta, and hence 

E[7]'^A7]] = trace(A)/m. 

Replace equation dH) by 

A^A^ICp. (9) 

where A = trace(^)/?Ti. This equation has the solution A = |^|/a/A, and 
substitution into the ansatz leads to yi — ^i/vA, a transformation with Ja- 
cobian J = A~™/^. The difference between equations ([SJ and ^ can be 
compensated for by adding to (j> the term X^Krf^Arj) — A]. Notice now that 
as m — > 00, {rj^ Arf) -^ A (a stochastic weak law of large numbers), so that 
when the number of variables is sufficiently large, the perturbation one has to 
compensate for becomes negligible. Generalizations and applications of this 
construction will be given elsewhere in the context of specific applications. 

One can readily devise algorithms also for cases where F is not C/-shaped, 
for example, by dividing F into monotonic pieces and sampling each of these 
pieces with its predetermined probability. An alternative that is usually easier 
is to replace the non-t/-shaped function i^ by a suitable [/-shaped function Fq 
and make up for the bias by adding Fq[X) — F{X) to the additive term (/>; see 
the examples below. 

4 Backward sampling and sparse observations 

The algorithms of the previous sections are sufficient to create a filter, but ac- 
curacy may require an additional step. Every observation provides information 



not only about the future but also about the past- it may, for example, tag as 
improbable earlier states that had seemed probable before the observation was 
made; in general one has to go back and correct the past after every observation 
(this backward sampling is often misleadingly motivated solely by the need to 
create greater diversity among the particles in a Bayesian filter). A detailed 
construction has been presented in [5]; the examples in the present paper are 
simple enough so that backward sampling does not significantly enhance their 
performance, so we will be content here with presenting the construction in 
principle, without much detail; it is a straightforward extension of the work 
above. 

Consider the i-th particle, and suppose we have sampled its positions X""^, 
^n^ j^n+i ^^ times t"^^,i", t"+^. Now we would like to go back and resample 
a new position X" at time i", given X"^'^ and X^'^^ . The probability density 
oiX = X" is proportional to P{X) = P(X|X"-i)P(&"|X)P(X"+i|X). Write 
P{X) = exp(— F(X)), sample a Gaussian reference variable ^, solve F{X) — (j) = 
^'^£,/2 as above, and you are done. If need be, one can then go further back and 
resample X"~^,X"~^, . . . Note that backward sampling relates Pn+i to P„_fc 
for fc > 0. 

A similar construction can be used when the observations are sparse in 
time, for example, if the time step needed to discretize the SDE accurately is 
shorter than the time interval between observations. Suppose we have sam- 
pled X"~^, have an observation at time i"+^ but not at time i", so that 
we have to sample simultaneously X'^ and X"+^ from the SDE and the ob- 
servation 6"+^. The joint probability of X = (X",X"+^) is proportional 
to P(X"|X"-i)P(X"+i|X")P(6"+i|X"+i). Again, write this probability as 
exp(— P(X)) and equate F{X) — (/> to ^'^£,/2, where f is a 2M-dimensional ref- 
erence variable. Detailed expression for the vector case, as well as examples, 
can be found in [6]. 

5 Examples 

We now present examples that illustrate the algorithms we have just described. 
For more examples, see [5l|6]. 

We begin with a response to a comment we have often heard: "this is nice, 
but the construction will fail the moment you are faced with potentials with 
multiple wells" . This is not so- the function F depends on the nature of the 
noise in the SDE and on the function h — h{x) in the observation equation 
([2), but not on the potential. Consider for example a one dimensional particle 
moving in the potential V{x) = 2.b{x^ — 0.5)^, (see Figure 1), with the force 
f(x) = -W = -10x(a;2 - 1) and the resulting SDE dx = fix)dt + adw, 
where a = .1 and w is Brownian motion with unit variance; with this choice of 
parameters the SDE has an invariant density concentrated in the neighborhoods 
of X = ±^/T/2. We consider linear observations b'^ = x{t^) + W , where VF is a 
Gaussian variable with mean zero and variance s — .025. We approximate the 
SDE by an Euler scheme |llj with time step 5 = 0.01, and assume observations 




Figure 1: The potential in the first example. 



are available at all the points nS. The particles all start at a; = 0. We produce 
data 6" by running a single particle and adding to its positions errors drawn 
from the assumed error density in equation ([2]) , and then attempt to reconstruct 
this path with our filter. 

For the i-the particle located at time nS at Xf the function F{X) is 



F{X) ^{X~ X:r/i2a5) + [X 



>n+^2N2 



/(2.) 



which is always convex. A completion of a square yields min F = (j) — (1/2) {X^— 
6"+^)^/ ((7(5 + s); the Jacobian J is independent of the particle and need not be 
evaluated. In Figure 2 we display a particle run used to generate data and 
its reconstruction by our filter with 50 particles. This figure is included for 
completeness but both of these paths are random, their difference varies from 
realization to realization, and may be large or small by accident. To get a quan- 
titative estimate of the performance of the filter, we repeated this calculation 
10^ times and computed the mean and the variance of the difference A between 
the run that generated the data and its reconstruction at time t = 1, see Table 
I. This Table shows that the filter is unbiased and that the variance of A is 
comparable to the variance of the error in the observations s — 0.025. Note 
that even with one single particle (and therefore no resampling) the results are 
still acceptable. 

We now discuss the relation between the posterior we wish to sample and 
the prior in several special cases, including non-convex situations. We want to 
produce samples of the pdf P{x) — exp(— _F(a;))/Z, where Z is a normalization 
constant and 

F{x) = x^/{2a) + {h{x) - bf/{2s) (10) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 2: A random path (broken line) and its reconstruction by our filter (solid 
line). 



Table I 

Mean and variance of the discrepancy between the observed path and the re- 
constructed path in example 1 as a function of the number of particles M, with 
s = 0.025. 



M 


mean 


variance 


100 


-.0001 


.021 


50 


-.0001 


.022 


20 


-.0001 


.023 


10 


.0001 


.024 


5 


-.0001 


.027 


1 


-.0001 


.038 



10 



and h{x) is a given function of x (as in equation 0) and cr, s, b are given param- 
eters. This can be viewed as a the first time step in time for a filtering problem 
where all the particles start from the same point so that exp{—F{x))/Z = Pi, 
or as an analysis of the sampling for one particular particle in a general filtering 
problem, or as an instance of the more general problem of sampling a given pdf 
when the important events may be rare. In standard Bayesian sampling one 
samples the variable with pdf exp(— a;^)/(2(T))/v'27rcr and then one attaches to 
the sample at x the weight exp(— (/i(x) — 6)^/(2s)); in an implicit sampler one 
finds a sample x by solving F{x) — = £,^/2 for a suitable (j) and ^ and attaching 
to the sample the weight exp(— 0) J. For given cr, s, the problem becomes more 
challenging as |6| increases. 

In both the standard and the implicit filters one can view the empirical pdf 
generated by the unweighted samples as a "prior" and the one generated by 
the weighted samples as the "posterior" . The difficulty with standard filters is 
that the prior and posterior densities may approach being mutually singular, 
so it is of interest to estimate the Radon-Nikodym derivative of one of these 
with respect to the other. If that derivative is a constant, we have achieved 
perfect importance sampling, as every neighborhood in the sample space is 
visited with a frequency proportional to its density. We estimate the Radon- 
Nikodym derivative of the prior with respect to the posterior as follows. In this 
simple problem one can evaluate the probability of any interval with respect to 
the posterior we wish to sample by quadratures. We divide the interval [0, 1] into 
K pieces of equal lengths l/K, then find numerically points Yi, I2, • • • , Yk~i, 
with Yk — -l-oo, such that the posterior probability of the interval [— 00, Y^] is 
k/K for k = 1,2, . . . ,K. We then find L — 10^ samples of the prior and plot of 
a histogram of the frequencies with which these samples fall into the posterior 
equal probability intervals {Yk-i,Yk). The more this histogram departs from 
being a constant independent of k, the more samples are needed to calculate 
the statistics of the posterior. 

If h{x) is linear, the weights in the implicit filter are all equal and the his- 
togram is constant for all values of b. This remains true for all values of b, i.e., 
however far the observation b is from what one may expect from the SDE alone. 
This is not the case with a standard Bayesian filter, where some parts of the 
sample space that have non-zero probability are visited very rarely. In Table II 
we list the histogram of frequencies for a linear observation function h{x) = x 
and 6 = 2 in a standard Bayesian filter, with K=10. We used 10'* samples; 
the fluctuations in the implicit case measure only the accuracy with which the 
histogram is computed with this number of samples. 

As a consequence, estimates obtained with the implicit filter are much more 
reliable than the ones obtained with the standard Bayesian filter. In Table III we 
list the estimates of the mean position of the linear problem as a function of b, 
with 30 particles, cr = s = 0.1, for the standard Bayesian and the implicit filters, 
compared with the exact result. The standard deviations are not displayed, they 
are all near 0.01. 

The results in this one-dimensional problem mirror the situation with the 
example of Bickel et al. O [M] , designed to display the breakdown of the stan- 
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Table II 

Histogram of the Radon-Nikodym derivative of the prior with respect to the 
posterior, standard Bayesian fihcr vs. the imphcit fiher, 10000 particles, 6 = 2, 
(7 = s = 0.1, h(x) =^ X. 



k 


standard 


implic 


1 


.987 


.099 


2 


.006 


.108 


3 


.002 


.097 


4 


.001 


.099 


5 


.004 


.101 


6 


.003 


.099 


7 


.001 


.101 


8 


.001 


.101 


9 


.000 


.102 


10 


.000 


.093 



Table III 



Comparison of the the estimates of the means, implicit vs. standard filter, 30 
particles, together with the exact results, linear case, as explained in the text. 



b 


exact 


standard 


implicit 








-.05 


.02 


0.5 


.25 


.10 


.27 


1. 


.5 


.18 


.51 


1.5 


.75 


.23 


.76 


2. 


1. 


.26 


1.01 
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Figure 3: A non-convex function F (solid line) and a t/-sliaped substitute (bro- 
ken line). 



dard Bayesian filter when the number of dimension is large; what happens there 
is that one particle hogs almost the whole weight, so that the number of parti- 
cles needed grows catastrophically; in contrast, the implicit filter assigns equal 
weights to all the particles in any number of dimensions, so that the number of 
particles needed is independent of dimension, see also [6] . 

We now turn to nonlinear and nonconvex examples. Let the observation 



function h be strongly nonlinear: h{x) 



With a = s = 0.1; the pdf dTO 



becomes non-[/-shaped for |6| > .77. In Figure 3 we display the function F 
for 6 = 1 (the solid curve). To use the algorithms above we need a substitute 
function Fq that is [/-shaped; we also display in Figure 3 (the broken line) the 
function Fq we used; the recipe here is to link a point above the local minimum 
on the left to the absolute minimum on the right by a straight line. There are 
many other possible constructions; the only general rule is to make the minimum 
of Fq equal the absolute minimum of F, for obvious reasons. As described above, 
we solve Fo{x) — <f) = £,'^^/2 and set (j) = minFo -I- Fo{x) — F{x). It is important 
to note that this construction does not introduce any bias. The function Fq 
constructed in this way is [/-shaped but need not be convex, so that one needs 
algorithm (B) described above. In Table IV we compare the Radon-Nikodym 
derivatives of the prior with respect to the posterior for the resulting implicit 
sampling and for standard Bayesian sampling with a = s = 0.1, b = 1.5. 

The histogram for the implicit filter is no longer perfectly balanced. The 
asymmetry in the histogram reflects the asymmetry of Fq and can be eliminated 
by biasing ^, but there is no reason to do so; there is enough importance sampling 
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Table IV 

Radon-Nikodym derivatives of the prior with respect to the posterior, h(x) 
x^, a = s = 0.1, b = 1.5, 10000 samples, Fq as in the text. 



k 


standard 


explicit 


1 


.9948 


.0899 


2 


.0028 


.0537 


3 


.0011 


.0502 


4 


.0004 


.0563 


5 


.0003 


.0696 


6 


.0002 


.1860 


7 


.0001 


.1107 


8 


.0001 


.1194 


9 


.0001 


.1196 


10 


0. 


.1446 



Table V 

Comparison of the the estimates of the means, implicit vs. standard filter, 1000 

as explained in the 
text. 



th the exact result, when 


h{x) = x^ 


b exact 


standard 


implicit 


0. 0. 


-0.00 ±.01 


-.00 ±.01 


.5 .109 


.109 ±.01 


.109±.01 


1.0 .442 


.394 ± .04 


.451±.02 


1.5 .995 


.775±.09 


.995±.01 


2.0 1.18 


.875±.05 


1.18±.01 


2.5 1.30 


.895 ±.02 


1.29±.02 



without this extra step. 

In Table V we display the estimates of the means of the density for the 
two filters with 1000 particles for various values of 6, compared with the exact 
results (the number of particles is relatively large because with h(x) = x^ and 
our parameter choices the variance of the conditional density is significant, and 
this number of particles is needed for meaningful comparisons of either algorithm 
with the exact result). 

As mentioned in the previous section, there are alternatives to the replace- 
ment of _F by _Fo; the point is that for each particle the function F is an explicitly 
known non-random function, and this fact can be used in multiple ways. 
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6 Parameter identification 

One important application of particle filters is to parameter identification, where 
the SDE contains an unknown parameter and the data are used to find this 
parameter's value. One of the standard ways of doing this (see e.g [Tj) is system 
augmentation: one adds to the SDE the equation da = Q for the unknown 
parameter a, one offers a a gamut of possible values, and one relies on the 
resampling process that eliminate the values that do not fit the data. With the 
implicit filter this procedure fails, because the particles are not eliminated fast 
enough. The alternative we are proposing is finding the unknown parameter a 
by stochastic approximation. Specifically, Find a statistic T of the output of 
the filter which is a function of cr, such that the expected value E[T] vanishes 
when a has the right value a* , and then solve the equation E[T] = E[T{a)] = 
by the Robbins- Monro algorithm ^3j, in which the equation E[T] =0 is solved 
by the iteration: 

Cn+l = cr„ - a„T(cr„), (11) 

where which converges when the coefficients a„ are such that ^ a„ — > cx) while 
J2 ctn remains bounded. 

As a concrete example, consider the SDE dx — dW, where W is Brownian 
motion with variance a, discretized with time steps S, with observations 6" = 
x" +r/, where 77 is a Gaussian with mean zero and variance s. Data are generated 
by running the SDE once with the true value a* of cr, adding the appropriate 
noise, and registering the result at time nS as 6" for n — 1,2, ... ,N. For the 
functional T we choose 

T{a) = C ^(A,A,_i)/ ((E ^')(E ^-1)) '^' ' (12) 

where the summations are over i between 2 and N, A^ is the estimate of the 
increment of x in the i-the step and C is a scaling constant. Clearly if the a used 
in the filtering equals a* then by construction the successive values of A^ are 
independent and E[T] = 0.. We picked the parameters N = 100, a = 10~^, s = 
10~^, 6 = 0.01 (so that that the increment of W in one step has variance 10"^). 

Our algorithm is as follows: We make a guess cri, run the filter for N steps, 
evaluate T, and make a new guess for a using equation (jlip and ai — 1, rerun 
the filter, etc., with the On, the coefficient in equation (|11|) at the n-th step, 
equal to 1/n. The scaling factor in (fTTj) was found by trial and error: if it is too 
large the iteration becomes unstable, if it is too small the convergence is slow; 
we settled on C = 4. 

This algorithm requires that the filter be run without either resampling 
or backward sampling, because resampling and backward sampling introduce 
correlations between successive values of the A.; and bias the values of T. In a 
long run, in particular in a strongly nonlinear setting, one may need resampling 
for the filter to stay on track, and this can be done by segmentation: divide 
the run of the filter into segments of some moderate length L, perform the 
summations in the definition of T over that segment, then go back and run that 
segment with resampling, then proceed to the next segment, etc. 
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Table VI 

Convergence of the parameter identification algorithm. 

Iteration new estimate a j a* 






10. 


1 


.819 


2 


.943 


3 


1.02 


4 


1.05 


5 


1.08 


6 


1.10 


7 


1.13 


8 


1.15 


9 


1.16 


10 


1.17 


11 


1.18 


12 


1.18 


13 


1.18 



The first question is, how well is it possible in principle to reconstruct an 
unknown value of a from N observations; this issue was already discussed in 
[S]. Given 100 samples of a Gaussian variable of mean and variance a, the 
variance reconstructed from the observations is a random variable of mean a 
and variance .16 • ct; 100 observations do not contain enough information to 
reconstruct a perfectly. A good way to estimate the best result that can be 
achieved is to run the algorithm with the guess a\ equal to the exact value cr* 
with which the data were generated. When this was done, the estimate of a 
was 1.27(7*. This result indicates the achievable accuracy. 

In Table VI we display the result of our algorithm when we start with the 
starting value a\ — lOu* and with 50 particles. Each iteration requires that one 
run the filter once. 

7 Conclusions 

We have presented the implicit filter for data assimilation, together with several 
algorithms for the solution of the algebraic equations, including cases with non- 
convex functions F ^ as well as an algorithm for parameter identification. The key 
idea in implicit sampling is to solve an algebraic equation of the form F(X) — (f) = 
^"^^/2 for every particle, where the function F is explicitly known, X is the 
new position of the particle, 4> is an additive factor, and ^ is a sample of a 
fixed reference pdf; F varies from particle to particle and step to step. This 
construction makes it possible to guide the particles to the high-probability 
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area one by one under a wide variety of circumstances. It is important to note 
that the equation that Unks ^ to X is underdetermined and its solution can be 
adapted for each particular problem. 

Implicit sampling is of interest in particular because of its potential uses in 
high dimensional problems, which are only briefly alluded to in the present pa- 
per. The effectiveness of implicit sampling in high-dimensional settings depends 
on one's ability to design maps ^ — > a; that satisfy the criteria above and are 
computationally efficient. The design of such maps is problem dependent and 
we will present examples in the context of specific applications. 
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